Description

For your final project you are asked to analyse UK house price data from the Land Registry [1,2]. The entire dataset (4 Gb), a subset of the dataset [3] (if you cannot process the entire 4 Gb dataset), and a description of the columns are also available here [4]. See also attached “starting-script.r”. The exercise is divided into the following tasks:

Task A:

A1. For the 33 London boroughs create a box-plot (or several box-plots) that compares house prices between the boroughs. Can you think of a better way to compare borough house prices (please demonstrate)?

A2. Could the entire dataset be used to estimate the relationship between price of flats and floor level? If yes, how would you show that relationship in a plot?

Task B:

B1. Create a GeoJSON file where each postcode is represented with a latitude, longitude value, together with minimum, maximum, mean and median house price.

B2. Open the GeoJSON file in the GIS application of your choice and colour-code the data to give an overview of areas with high, medium and low median house price. Additionally, you can visualise this information as cloropleths or use shiny and add the information as markers on a map for a more interactive and impressive result.

B3. Instead of using median price, you could have been asked to colour-code the mean house price. Would that have given a better view of the house prices across the UK? Please justify your answer.

Task C:

C1. Examine the house prices for 2015. How do these change over time? Do property prices seem to increase or decrease throughout the year?

C2. Is there a significant relationship between the price of a property and the time of year it is sold? Does this vary with type of property?

Solution:

We first load the libraries we will be using during the exercise

# Load libraries
library(Hmisc, quietly=TRUE)
library(dplyr, quietly=TRUE)
library(ggplot2, quietly=TRUE)
library(scales, quietly=TRUE)
library(maps, quietly = TRUE)
library(leaflet, quietly = TRUE)
library(sp, quietly = TRUE)
library(maptools, quietly = TRUE)
library(mapdata, quietly = TRUE)
library(geojsonio, quietly = TRUE)
library(spdplyr, quietly = TRUE)
library(stringr, quietly = TRUE)
library(dygraphs, quietly = TRUE)
library(xts, quietly = TRUE)
library(viridis, quietly = TRUE)
library(leaflet, quietly = TRUE)
library(htmlwidgets, quietly = TRUE)
library(IRdisplay, quietly = TRUE)
library(gridExtra, quietly = TRUE)
library(RColorBrewer, quietly = TRUE)

And load the two files we will require: ppdata and ukpostcodes. We will be using the lite version of the ppdata to improve the perfomance on this exercise

NOTE: The following chunk only needs to be executed the first time. After that we will be loading the rds data file produced from reading the files and joining them together.

# Load price paid data.
#load("data/ppdata")
#ppdata <- read.csv("data/ppdata_lite.csv", header = TRUE, sep = ',')

# Load file with postcodes and latitude/longitude
#ukpostcodes <- read.csv("data/ukpostcodes.csv", header = TRUE, sep = ',')

# Merge postcodes and ppdata
#ppdata <- merge(ppdata, ukpostcodes, by = "postcode")

# Save a single object to a file
#saveRDS(ppdata, "ppdata.rds")
# Reload the data from saved rds file
ppdata <- readRDS("ppdata.rds")
# display preview of the joined data
head(ppdata,10)
##    postcode          transaction_unique_identifier  price date_of_transfer
## 1   AL1 1AJ {F42AB372-96E9-4C62-A659-77AC8C0EAEDF} 159995 1997-07-25 00:00
## 2   AL1 1AJ {4911E580-FE15-43E0-9C80-3A36863ABBC7} 300000 2006-09-22 00:00
## 3   AL1 1AJ {656892E1-43C7-4CE5-83A3-28DA979FB9F4} 143995 1997-06-20 00:00
## 4   AL1 1AJ {B3BA577D-02A6-40BA-9C17-2167A5C27794} 435000 2011-06-30 00:00
## 5   AL1 1AJ {A0278D4E-0B75-48A7-AAA2-1587E9A7F9BE} 158950 1997-10-27 00:00
## 6   AL1 1AJ {6A55C1DF-1633-493F-A38B-3204615D3EE5} 139995 1997-06-27 00:00
## 7   AL1 1AJ {D8D977D1-2B0B-46B0-BA6E-187A9ECE8D18} 159995 1997-06-27 00:00
## 8   AL1 1AJ {932A3304-6558-43E4-B4B2-3E3FFFF9CB6E} 101950 1997-03-27 00:00
## 9   AL1 1AJ {2845B4B3-AE0E-41F7-842E-C4B9A1E4FFA9} 159995 1997-07-25 00:00
## 10  AL1 1AJ {81C267F5-65C1-4F88-B629-C7E1D8AD74C4} 310000 2006-06-01 00:00
##    property_type old_new duration PAON SAON       street   locality  town_city
## 1              S       Y        F   28      ORIENT CLOSE ST. ALBANS ST. ALBANS
## 2              T       N        F   21      ORIENT CLOSE ST. ALBANS ST. ALBANS
## 3              S       Y        F   17      ORIENT CLOSE ST. ALBANS ST. ALBANS
## 4              T       N        F   22      ORIENT CLOSE             ST ALBANS
## 5              S       Y        F    2      ORIENT CLOSE ST. ALBANS ST. ALBANS
## 6              S       Y        F   20      ORIENT CLOSE ST. ALBANS ST. ALBANS
## 7              T       Y        F   21      ORIENT CLOSE ST. ALBANS ST. ALBANS
## 8              S       Y        F   49      ORIENT CLOSE ST. ALBANS ST. ALBANS
## 9              S       Y        F   28      ORIENT CLOSE             ST ALBANS
## 10             S       N        F    5      ORIENT CLOSE ST. ALBANS ST. ALBANS
##     district        county PPD_category_type record_status      id latitude
## 1  ST ALBANS HERTFORDSHIRE                 A             A 1578238 51.74448
## 2  ST ALBANS HERTFORDSHIRE                 A             A 1578238 51.74448
## 3  ST ALBANS HERTFORDSHIRE                 A             A 1578238 51.74448
## 4  ST ALBANS HERTFORDSHIRE                 A             A 1578238 51.74448
## 5  ST ALBANS HERTFORDSHIRE                 A             A 1578238 51.74448
## 6  ST ALBANS HERTFORDSHIRE                 A             A 1578238 51.74448
## 7  ST ALBANS HERTFORDSHIRE                 A             A 1578238 51.74448
## 8  ST ALBANS HERTFORDSHIRE                 A             A 1578238 51.74448
## 9  ST ALBANS HERTFORDSHIRE                 A             A 1578238 51.74448
## 10 ST ALBANS HERTFORDSHIRE                 A             A 1578238 51.74448
##     longitude
## 1  -0.3285997
## 2  -0.3285997
## 3  -0.3285997
## 4  -0.3285997
## 5  -0.3285997
## 6  -0.3285997
## 7  -0.3285997
## 8  -0.3285997
## 9  -0.3285997
## 10 -0.3285997

To understand better the dataset we print out a description of the entire dataset using the hmisc library. The output gives us valuable information that we will use throughout the exercise.

Some things that initially stand out are: - quite a large number of observations (>2M) it will be interesting how the graph perfoms with the raw data - there are some variables that have some missing values, we will need to be aware of those during the exercise - There is definitely some outliers in the price feature, we will decide how to treat them depending on the case

# Run a full description of the dataset
describe(ppdata)
## ppdata 
## 
##  19  Variables      2494557  Observations
## --------------------------------------------------------------------------------
## postcode 
##        n  missing distinct 
##  2494557        0   818952 
## 
## lowest : AL1 1AJ  AL1 1AR  AL1 1AS  AL1 1BH  AL1 1BX 
## highest: YO8 9XN  YO8 9XP  YO8 9YA  YO8 9YD  YO90 1WR
## --------------------------------------------------------------------------------
## transaction_unique_identifier 
##        n  missing distinct 
##  2494557        0  2494557 
## 
## lowest : {000000FE-94CA-47DA-8D75-6FDFA5960D75} {00000E84-0C74-41E3-B236-27DC73CF7600} {00001B92-1FAE-49C7-8322-DAB17FBD4D19} {000026CE-CA8C-4A26-9136-62D07F0FB88F} {0000290A-BFB7-497A-88FE-83FFDDCDCDB5}
## highest: {FFFFE6DF-3996-4FE1-B76A-A38B8ADC801F} {FFFFEFE8-1FE2-4E62-B96F-80164E5D9C78} {FFFFF44B-C0A6-41E0-BD12-EDB19583BE1F} {FFFFF632-335D-4535-B6A5-8A942CF167D9} {FFFFFC82-7D94-402B-8F84-8EEADD004339}
## --------------------------------------------------------------------------------
## price 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##  2494557        0    40742        1   169655   146156    34000    44995 
##      .25      .50      .75      .90      .95 
##    72500   126000   202000   314000   420000 
## 
## lowest :        1       11       20       40       95
## highest: 75000000 80000000 96000000 96652091 97500000
## --------------------------------------------------------------------------------
## date_of_transfer 
##        n  missing distinct 
##  2494557        0     7462 
## 
## lowest : 1995-01-01 00:00 1995-01-02 00:00 1995-01-03 00:00 1995-01-04 00:00 1995-01-05 00:00
## highest: 2016-03-25 00:00 2016-03-27 00:00 2016-03-29 00:00 2016-03-30 00:00 2016-03-31 00:00
## --------------------------------------------------------------------------------
## property_type 
##        n  missing distinct 
##  2494557        0        5 
## 
## lowest : D F O S T, highest: D F O S T
##                                              
## Value           D      F      O      S      T
## Frequency  573181 449566   4002 695273 772535
## Proportion  0.230  0.180  0.002  0.279  0.310
## --------------------------------------------------------------------------------
## old_new 
##        n  missing distinct 
##  2494557        0        2 
##                           
## Value            N       Y
## Frequency  2242783  251774
## Proportion   0.899   0.101
## --------------------------------------------------------------------------------
## duration 
##        n  missing distinct 
##  2494557        0        3 
##                                   
## Value            F       L       U
## Frequency  1910492  583991      74
## Proportion   0.766   0.234   0.000
## --------------------------------------------------------------------------------
## PAON 
##        n  missing distinct 
##  2494053      504   120578 
## 
## lowest : (AKERS)          (ANNINGS)        (AVEY)           (BELL)           (BEVERLEY)      
## highest: ZOE COURT, 26    ZOLOTY RO KEE, 7 ZONDELI          ZONE A           ZWARTKOPS       
## --------------------------------------------------------------------------------
## SAON 
##        n  missing distinct 
##   279072  2215485     8944 
## 
## lowest : (ADAMS)         (ANDERTON)      (BENDEL)        (BIELBY)        (COLLIN)       
## highest: YORK HOUSE      YORK HOUSE FLAT YR YSGUBOR      ZADANIE COTTAGE ZONE 7, LEVEL 3
## --------------------------------------------------------------------------------
## street 
##        n  missing distinct 
##  2458323    36234   230009 
## 
## lowest : 12TH AVENUE  14TH AVENUE  16TH AVENUE  18TH AVENUE  19TH AVENUE 
## highest: ZULU MEWS    ZULU ROAD    ZURA AVENUE  ZURA DRIVE   ZURICH CLOSE
## --------------------------------------------------------------------------------
## locality 
##        n  missing distinct 
##  1829283   665274    19375 
## 
## lowest : AB KETTLEBY     ABBERLEY        ABBERTON        ABBESS RODING   ABBEY GATE     
## highest: ZEAL MONACHORUM ZEALS           ZELAH           ZENNOR          ZOUCH          
## --------------------------------------------------------------------------------
## town_city 
##        n  missing distinct 
##  2494557        0     1165 
## 
## lowest : ABBOTS LANGLEY ABERAERON      ABERDARE       ABERDOVEY      ABERGAVENNY   
## highest: YATELEY        YELVERTON      YEOVIL         YORK           YSTRAD MEURIG 
## --------------------------------------------------------------------------------
## district 
##        n  missing distinct 
##  2494557        0      452 
## 
## lowest : ABERCONWY                 ADUR                      ALLERDALE                 ALNWICK                   ALYN AND DEESIDE         
## highest: WYCOMBE                   WYRE                      WYRE FOREST               YNYS MON-ISLE OF ANGLESEY YORK                     
## --------------------------------------------------------------------------------
## county 
##        n  missing distinct 
##  2494557        0      127 
## 
## lowest : AVON                         BATH AND NORTH EAST SOMERSET BEDFORD                      BEDFORDSHIRE                 BERKSHIRE                   
## highest: WOKINGHAM                    WORCESTERSHIRE               WREKIN                       WREXHAM                      YORK                        
## --------------------------------------------------------------------------------
## PPD_category_type 
##        n  missing distinct 
##  2494557        0        2 
##                           
## Value            A       B
## Frequency  2473246   21311
## Proportion   0.991   0.009
## --------------------------------------------------------------------------------
## record_status 
##        n  missing distinct    value 
##  2494557        0        1        A 
##                   
## Value            A
## Frequency  2494557
## Proportion       1
## --------------------------------------------------------------------------------
## id 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##  2494557        0   818952        1   856191   585464    87956   175931 
##      .25      .50      .75      .90      .95 
##   415743   825497  1331261  1555721  1641505 
## 
## lowest :     410     416     417     419     420
## highest: 1812444 1812455 1812456 1812458 1812459
## --------------------------------------------------------------------------------
## latitude 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##  2494557        0   815897        1    52.28    1.271    50.78    50.95 
##      .25      .50      .75      .90      .95 
##    51.44    51.94    53.28    53.78    54.34 
## 
## lowest : 49.76682 49.91333 49.91383 49.91393 49.91398
## highest: 55.78478 55.78510 55.78600 55.78674 55.78707
## --------------------------------------------------------------------------------
## longitude 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##  2494557        0   815925        1   -1.272    1.481  -3.4872  -2.9880 
##      .25      .50      .75      .90      .95 
##  -2.1802  -1.2727  -0.2022   0.2567   0.7068 
## 
## lowest : -7.557163 -6.352647 -6.319136 -6.318502 -6.318090
## highest:  1.756664  1.756813  1.757012  1.757150  1.758131
## --------------------------------------------------------------------------------

Task A1

A1. For the 33 London boroughs create a box-plot (or several box-plots) that compares house prices between the boroughs. Can you think of a better way to compare borough house prices (please demonstrate)?

We believe that the boroughs refer to the district column in the dataset, so to be sure, we filter the dataset by the city of london and verify the count of different districts we have.

# Select only london city and print out the unique number of districts
london_ppdata0 = ppdata %>% filter(town_city=='LONDON')
length(unique(london_ppdata0$district))
## [1] 33

The number 33 confirms our earlier suspicion so now we will try to create the boxplot as required by the question.

london_ppgraph0 = ggplot(data=london_ppdata0, aes(x=district, y=price)) +
  labs(x = 'District / Borough', y = 'Price', title = 'London house log(price) by district') +
  geom_boxplot() + 
  theme(axis.text.x = element_text(angle = -90, vjust = 1, hjust = 0), legend.position = 'bottom') + 
  scale_y_continuous(labels = comma)

london_ppgraph0

In the above boxplot graph we see a very strange spread on price values, which causes the boxplots to be basically invisible, probably caused per the existence of very extremely high price values (potential outliers).

To try to avoid this we will apply a log conversion to the price value to better visualize the details of the boxplot graphic

london_ppgraph1 = ggplot(data=london_ppdata0, aes(x=district, y=price)) +
  labs(x = 'District / Borough', y = 'Log(Price)', title = 'London house log(price) by district') +
  geom_boxplot() + 
  theme(axis.text.x = element_text(angle = -90, vjust = 1, hjust = 0), legend.position = 'bottom') +
  scale_y_log10(labels = comma)

london_ppgraph1

Still we see some lower price outliers, under 100 of price value. Since they are very few values and they are outstandingly low, we will remove those values to improve our visualization as well. This action of truncating n-highest and n-lowest values before assessing a mean of a given variable is called trimmed mean estimation.

We will also be creating a new summarized table with the number of observations, as we see some districts with low numbers there and for which we should take their means with more discretion.

Additionally it could be useful for later to know the mean price (without the log) to understand the real value of the house better, so we will print it as an annotation on the graph for each district.

Finally given that we have quite a few districts the visualization would benefit from displaying it horizontally and sorted by price mean(not log) in descending order.

# Remove lower outliers
london_ppdata1 = ppdata %>% 
  filter(town_city=='LONDON' & price>100)

# Add count of observations per district
london_ppdata_count = london_ppdata1 %>% 
  group_by(town_city, district) %>% 
  dplyr::summarize(n=n(), mean_price=mean(price), median_price = median(price), .groups='keep')
# reorder the district by highest median
london_ppdata1$district = with(london_ppdata1, reorder(district, price, median))

london_ppgraph2 = ggplot(data=london_ppdata1, aes(x=district, y=price)) +
  labs(x = 'District / Borough', y = 'Log(median price)', title = 'London house log(median price) by district') +
  geom_boxplot() +
  geom_text(data = london_ppdata_count, aes(y = 10, label = n), size =2) +
  geom_text(data = london_ppdata_count, aes(y = 20, label = paste0(round(median_price/1000,0),'K')), size =2, color='red') +
  theme(axis.text.x = element_text(angle = -90, vjust = 1, hjust = 0), legend.position = 'bottom') +
  scale_y_log10(labels = comma)+
  coord_flip()

london_ppgraph2

Another way to show this information would be to use a scatter plot and using color scaling. For this we will need to generate our own custom palette as we require as many as 33 colors.

# create custom palette based on number of districts to graph
colorCount = length(unique(london_ppdata1$district))
getPalette = colorRampPalette(brewer.pal(9,"RdYlBu"))

london_ppgraph3 = ggplot(data=london_ppdata1, aes(x=district, y=price, col=district)) +
  labs(x = 'District / Borough', y = 'Log(Price)', title = 'London house log(price) by district') +
  geom_jitter(size = 0.5) + 
  scale_color_manual(values=getPalette(colorCount)) +
  stat_summary(geom = 'point', fun= 'median', color = 'red', size = 1, alpha=0.8) +
  geom_text(data = london_ppdata_count, aes(y = 101, label = n), size =2, color='black') +
  geom_text(data = london_ppdata_count, aes(y = 201, label = paste0(round(median_price/1000,0),'K')), size =2, color='red') +
  theme(axis.text.x = element_text(angle = -90, vjust = 1, hjust = 0), legend.position = 'none') +
  scale_y_log10(labels = comma, limits = c(100,1e8))+
  coord_flip()
  
london_ppgraph3

Even though there are many district values faceting could give us additional insight into each of them.

london_ppgraph3 = ggplot(data=london_ppdata1, aes(x='', y=price) ) +
  labs(x = 'District / Borough', y = 'Log(Price)', title = 'London house log(price) by district') +
  geom_boxplot(alpha=0.1) +
  stat_summary(geom = 'point', fun= 'median', color = 'red', size = 1, alpha=0.8) +
  geom_text(data = london_ppdata_count, aes(y = 101, label = n), size=3) +
  geom_text(data = london_ppdata_count, aes(y = 201, label = paste0(round(median_price/1000,0),'K')), size=3, color='red') +
  theme(axis.text.x = element_text(angle = -90, vjust = 1, hjust = 0), legend.position = 'bottom') +
  scale_y_log10(labels = comma, limits = c(100,1e8)) +
  facet_wrap(~district, drop=FALSE, nrow=1, ncol=33)
  
london_ppgraph3

Faceting doesn’t look very good. It complicates the comparison and ordering by the mean value becomes more challenging it will not be a preferred option. We will keep as the best option the previous version with the point jitter and color scaling.

Task A2

A2. Could the entire dataset be used to estimate the relationship between price of flats and floor level? If yes, how would you show that relationship in a plot?

The only possible floor level information seems to be contained in the SAON(Secondary addresable object name) column. But the entire dataset can’t be used because the floor is not always present and in general the information on that column doesn’t seem very consistent.

Let’s analyze a sample of what values we have in that column:

head(unique(ppdata$SAON),100)
##   [1] ""                    "FLAT 13"             "FLAT 10"            
##   [4] "FLAT 4"              "FLAT 31"             "FLAT 23"            
##   [7] "2"                   "FLAT 11"             "FLAT 6"             
##  [10] "FLAT 1"              "3"                   "FLAT 50"            
##  [13] "FLAT 49"             "FLAT 34"             "FLAT 46"            
##  [16] "FLAT 2"              "FLAT 5"              "FLAT 3"             
##  [19] "SECOND FLOOR FLAT 4" "FLAT 17"             "16"                 
##  [22] "18"                  "9"                   "12"                 
##  [25] "FLAT 9"              "FLAT 8"              "1"                  
##  [28] "17"                  "21"                  "28"                 
##  [31] "31"                  "5"                   "4"                  
##  [34] "7"                   "GROUND FLOOR SHOP"   "25"                 
##  [37] "APARTMENT 31"        "APARTMENT 24"        "APARTMENT 11"       
##  [40] "APARTMENT 41"        "APARTMENT 10"        "6"                  
##  [43] "75"                  "6A"                  "10"                 
##  [46] "35"                  "43"                  "32"                 
##  [49] "24"                  "15"                  "FLAT 12"            
##  [52] "FLAT 15"             "36"                  "APARTMENT10"        
##  [55] "APARTMENT 7"         "APARTMENT 2"         "APARTMENT 1"        
##  [58] "APARTMENT 3"         "8"                   "40"                 
##  [61] "42"                  "57"                  "APARTMENT 9"        
##  [64] "11"                  "29"                  "49"                 
##  [67] "63"                  "13"                  "APARTMENT 20"       
##  [70] "APARTMENT 4"         "APARTMENT 8"         "APARTMENT 14"       
##  [73] "APARTMENT 16"        "APARTMENT 23"        "APARTMENT 36"       
##  [76] "APARTMENT 34"        "APARTMENT 6"         "APARTMENT 5"        
##  [79] "APARTMENT 15"        "APARTMENT 19"        "14"                 
##  [82] "19"                  "FLAT 27"             "FLAT 16"            
##  [85] "FLAT 19"             "FLAT 29"             "FLAT 32"            
##  [88] "FLAT 20"             "78"                  "80"                 
##  [91] "58"                  "26"                  "44"                 
##  [94] "27"                  "FLAT 14"             "FLAT 21"            
##  [97] "FLAT 7"              "FLAT 2A"             "FLAT 28"            
## [100] "FLAT 22"
tail(unique(ppdata$SAON),100)
##   [1] "BARN HOUSE"                "THE POLLARDS"             
##   [3] "2 BUNGALOW"                "MILTON COTTAGE"           
##   [5] "DOEBANK"                   "3 ORIEL CHAMBERS"         
##   [7] "MIDDLE WING"               "THE CUMBERLAND SUITE"     
##   [9] "THE WHEATSHEAF SUITE"      "THE MINSTER SUITE"        
##  [11] "WILLOW POOL BARN"          "3 SHERIFFS RIDE"          
##  [13] "1 SHERIFFS RIDE"           "SUNNY VIEW LODGE"         
##  [15] "GABLES END"                "TANNERS LODGE"            
##  [17] "9 BRERETON MEWS"           "4 BRERETON MEWS"          
##  [19] "SYCAMORE APARTMENT"        "PLOT 118"                 
##  [21] "SECOND FLOOR 48"           "33E"                      
##  [23] "31E"                       "4 WILLIAMS COURT"         
##  [25] "GROUND FLOOR FLAT 10"      "2 THE LAURELS"            
##  [27] "1 THE LAURELS"             "THE POOL HOUSE"           
##  [29] "1 COACH VIEW"              "THORNEYCROFT"             
##  [31] "THE DRIFTWAY"              "THE COMBERMERE ARMS"      
##  [33] "RYE CROFT 5"               "BEECH BARN 2"             
##  [35] "THE OLD COLLIERY"          "HAWTHORNS BARN"           
##  [37] "CHURCH VIEW BARN"          "22 NEW WING"              
##  [39] "12 NEW WING"               "9 NEW WING"               
##  [41] "29 SPARROWS NEST"          "THE POND HOUSE"           
##  [43] "RAVEN MEWS"                "ROOM 216"                 
##  [45] "ROOM 11A"                  "THE COACH HOUSE FLAT"     
##  [47] "FIRST-SECOND FLOOR"        "(BIELBY)"                 
##  [49] "HEADMISTRESS HOUSE"        "FLAT N1"                  
##  [51] "FLAT N13"                  "FLAT N3"                  
##  [53] "FLAT W7"                   "FLAT S4"                  
##  [55] "FLAT N11"                  "END COTTAGE"              
##  [57] "FLAT N19"                  "FLAT N23"                 
##  [59] "FLAT N7"                   "FLAT W6"                  
##  [61] "FLAT W11"                  "DANBY HOUSE"              
##  [63] "2 STAR COTTAGES"           "3 STAR COTTAGES"          
##  [65] "WENLOCK"                   "GULL NEWK"                
##  [67] "PUFFIN COTTAGE"            "MEADOWSWEET COTTAGE"      
##  [69] "FLAT 1A & 1-5"             "GROUND FLOOR 1"           
##  [71] "APARTMENT 4, MARINE COURT" "SUMMERHEIM"               
##  [73] "THIMBLE COTTAGE"           "FALT 3"                   
##  [75] "FLAT F16"                  "1 MEWS COTTAGES"          
##  [77] "THE OAK ROOM 14"           "WISTERIA BARN"            
##  [79] "HARRY'S COTTAGE"           "CROSS KEYS COTTAGE"       
##  [81] "1810"                      "APARTMENT 1.10A"          
##  [83] "518"                       "419"                      
##  [85] "616"                       "COB BARN"                 
##  [87] "CLEVELAND HOUSE"           "APARTMENT S"              
##  [89] "ROOM 16"                   "FLAT 3 EASTERN VILLA"     
##  [91] "GALLERY"                   "HOPE HOUSE"               
##  [93] "MILLERS BARN"              "WARKWORTH 1"              
##  [95] "AUTUMN BARN"               "AMBLESIDE"                
##  [97] "TROUTBECK"                 "MUSCOATES BARN"           
##  [99] "ROSEDALE COTTAGE"          "EASTHOLME"

Also we can check if there are NA or empty strings

saon_total = nrow(ppdata)
# check for NA
saon_na = ppdata %>% filter(is.na(SAON)) %>% dplyr::summarise(n=n())
saon_na = saon_na/saon_total*100
# check for empty strings
saon_empty = ppdata %>% filter(SAON=='') %>% dplyr::summarise(n=n())
saon_empty = saon_empty/saon_total*100

print(paste0('% of NAs: ', saon_na))
## [1] "% of NAs: 0"
print(paste0('% of empty strings: ', saon_empty))
## [1] "% of empty strings: 88.8127631479257"

Given that the data is very inconsistent and that the number of empty strings is very high, we can conclude it is not possible to graph a consistent comparison between price and floor level with the full dataset.

Task B1

B1. Create a GeoJSON file where each postcode is represented with a latitude, longitude value, together with minimum, maximum, mean and median house price

First we create a new version of the original data, only containing the requested features, this will speed up the process of converting the file later.

ppdata_geojson = ppdata %>% 
  group_by(postcode, latitude, longitude) %>% 
  dplyr::summarise(max_price=max(price),min_price=min(price),mean_price=mean(price),median_price=median(price), .groups ='keep')

head(ppdata_geojson)
## # A tibble: 6 × 7
## # Groups:   postcode, latitude, longitude [6]
##   postcode latitude longitude max_price min_price mean_price median_price
##   <chr>       <dbl>     <dbl>     <int>     <int>      <dbl>        <dbl>
## 1 AL1 1AJ      51.7    -0.329    435000    101950    203018.       159995
## 2 AL1 1AR      51.7    -0.317    350000    250000    300000        300000
## 3 AL1 1AS      51.7    -0.335    500000    285000    356667.       285000
## 4 AL1 1BH      51.7    -0.338    187000     55500    134500        170000
## 5 AL1 1BX      51.7    -0.341   1200000    270000    731667.       725000
## 6 AL1 1BZ      51.8    -0.340    390000    140000    265000        265000

And now we convert the resulting dataframe to an Spatial object and write out the GeoJSON file

# convert longitude and latitude fields to coordinates
coordinates(ppdata_geojson) = c('latitude', 'longitude')

# check if the geojson file already exists and delete. 
# without this prior steps the writeOGR will fail if file already exists.
fn <- 'output.geojson'
if (file.exists(fn)) {
  file.remove(fn)
}
## [1] TRUE
# write geojson file
rgdal::writeOGR(ppdata_geojson, "output.geojson", layer = "postcodes", driver = "GeoJSON")

Task B2

Open the GeoJSON file in the GIS application of your choice and colour-code the data to give an overview of areas with high, medium and low median house price. Additionally, you can visualise this information as cloropleths or use shiny and add the information as markers on a map for a more interactive and impressive result.

We will not be using the GeoJSON file we generated. Instead we will work with the initial ppdata dataset and a newly imported shape file that contains the polygons of all UK area codes.

The shapefile was obtained from this url: https://www.opendoorlogistics.com/downloads/

The reason we are using this file instead of the one provided is that the shapefile contains the polygon data and ids that resemble (but not entirely as explained below) the postcodes provided originally.

First we do some maintenance by adding a new column to the original dataset based on the original postcode column but this one will only contain the characters up to the first space in the postcode string (which we believe correspond to the area code in the UK). This will be our new dataset id. We will also rename both postcode and id columns for convenience.

# Rename columns to reflect new id and postcode data
ppdata = rename(ppdata, long_postcode = postcode)
ppdata = rename(ppdata, old_id = id)

ppdata = ppdata %>% 
  mutate(id=str_split_fixed(long_postcode,' ',2)[,1])

head(ppdata)
##   long_postcode          transaction_unique_identifier  price date_of_transfer
## 1       AL1 1AJ {F42AB372-96E9-4C62-A659-77AC8C0EAEDF} 159995 1997-07-25 00:00
## 2       AL1 1AJ {4911E580-FE15-43E0-9C80-3A36863ABBC7} 300000 2006-09-22 00:00
## 3       AL1 1AJ {656892E1-43C7-4CE5-83A3-28DA979FB9F4} 143995 1997-06-20 00:00
## 4       AL1 1AJ {B3BA577D-02A6-40BA-9C17-2167A5C27794} 435000 2011-06-30 00:00
## 5       AL1 1AJ {A0278D4E-0B75-48A7-AAA2-1587E9A7F9BE} 158950 1997-10-27 00:00
## 6       AL1 1AJ {6A55C1DF-1633-493F-A38B-3204615D3EE5} 139995 1997-06-27 00:00
##   property_type old_new duration PAON SAON       street   locality  town_city
## 1             S       Y        F   28      ORIENT CLOSE ST. ALBANS ST. ALBANS
## 2             T       N        F   21      ORIENT CLOSE ST. ALBANS ST. ALBANS
## 3             S       Y        F   17      ORIENT CLOSE ST. ALBANS ST. ALBANS
## 4             T       N        F   22      ORIENT CLOSE             ST ALBANS
## 5             S       Y        F    2      ORIENT CLOSE ST. ALBANS ST. ALBANS
## 6             S       Y        F   20      ORIENT CLOSE ST. ALBANS ST. ALBANS
##    district        county PPD_category_type record_status  old_id latitude
## 1 ST ALBANS HERTFORDSHIRE                 A             A 1578238 51.74448
## 2 ST ALBANS HERTFORDSHIRE                 A             A 1578238 51.74448
## 3 ST ALBANS HERTFORDSHIRE                 A             A 1578238 51.74448
## 4 ST ALBANS HERTFORDSHIRE                 A             A 1578238 51.74448
## 5 ST ALBANS HERTFORDSHIRE                 A             A 1578238 51.74448
## 6 ST ALBANS HERTFORDSHIRE                 A             A 1578238 51.74448
##    longitude  id
## 1 -0.3285997 AL1
## 2 -0.3285997 AL1
## 3 -0.3285997 AL1
## 4 -0.3285997 AL1
## 5 -0.3285997 AL1
## 6 -0.3285997 AL1

We will now do another aggregated dataset at short_postcode(area code) level and calculate the mean and median price. This dataset will be later joined with a newly imported shape file at area code level.

# aggregate data at district (id) level
ppdata_agg = ppdata %>% 
  group_by(id) %>% 
  dplyr::summarise(mean_price=mean(price),median_price=median(price), .groups ='keep')

Before we continue let’s have a look at the distribution of the mean price.

Based on what we saw on the first part of the exercise there were some considerable outliers in the data so it might be relevant to reduce the impact of those outliers by creating bins based on the median price. Later on we will check which visualization provides better insights.

# plot distribution of the mean price with a histogram 
plot_median_dist = ggplot(ppdata_agg, aes(x=log(median_price), fill = ..count..)) +
  labs(x = 'log median price', y = 'Count', title = 'UK house log median price distribution at district aggregation') +
  geom_histogram( bins=200) +
  theme(axis.text.x = element_text(vjust = 1, hjust = 0), legend.position = 'right')

# plot distribution of the mean price with a histogram 
plot_mean_dist = ggplot(ppdata_agg, aes(x=log(mean_price), fill = ..count..)) +
  labs(x = 'log mean price', y = 'Count', title = 'UK house log mean price distribution at district aggregation') +
  geom_histogram( bins=200) +
  theme(axis.text.x = element_text(vjust = 1, hjust = 0), legend.position = 'right')

grid.arrange(plot_median_dist, plot_mean_dist, nrow=2)

We can see the heavy tail to the right and also how the counts are highly concentrated in the middle of the distribution (even more with the log applied). This confirms the bining could be beneficial.

But how many bins should we create?. Let’s have a look at the quantiles and create bins for each quantile or half quantile.

# display quantile information
summary(ppdata_agg$mean_price)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    39616   113239   151721   275599   206695 68591328

Seems like half a quantile would give us a good spread, so let’s create the bins for both mean and median and add them to our aggregated dataframe.

# create bins for mean and median price
ppdata_agg = ppdata_agg %>% 
  mutate(mean_price_bins = cut(mean_price, breaks=c(-Inf, 50000,100000,125000,150000,175000,200000,Inf), labels=c('0-50K','50K-125K','100K-125K','125K-150K','150K-175K','175K-200K','200K+'))) %>% 
  mutate(median_price_bins = cut(median_price, breaks=c(-Inf, 50000,100000,125000,150000,175000,200000,Inf), labels=c('0-50K','50K-125K','100K-125K','125K-150K','150K-175K','175K-200K','200K+')))
head(ppdata_agg)
## # A tibble: 6 × 5
## # Groups:   id [6]
##   id    mean_price median_price mean_price_bins median_price_bins
##   <chr>      <dbl>        <dbl> <fct>           <fct>            
## 1 AL1      277663.       225000 200K+           200K+            
## 2 AL10     170007.       169950 150K-175K       150K-175K        
## 3 AL2      252364.       215000 200K+           200K+            
## 4 AL3      299182.       249950 200K+           200K+            
## 5 AL4      269784.       228500 200K+           200K+            
## 6 AL5      384532.       280000 200K+           200K+

So now we read a shape file containing the polygons at UK area code level and process the shape file so it can be used as a map in ggplot.

# read shape file and flatten it, using the name attribute in it as region id
uk_sh = rgdal::readOGR('./shape_data/Distribution/Districts.shp')
## OGR data source with driver: ESRI Shapefile 
## Source: "/home/ivanobat/gitrepos/viz/final_project/shape_data/Distribution/Districts.shp", layer: "Districts"
## with 2880 features
## It has 1 fields
uk_map = fortify(uk_sh, region='name')

For ggplot to accept the dataframe as a polygon map we also need to rename the longitude an latitude columns.

# rename longitude and latitude fields to fit ggplot format for polygon maps
uk_map = rename(uk_map, x=long)
uk_map = rename(uk_map, y=lat)
head(uk_map)
##           x        y order  hole piece   id  group
## 1 -2.116455 57.14656     1 FALSE     1 AB10 AB10.1
## 2 -2.116548 57.14663     2 FALSE     1 AB10 AB10.1
## 3 -2.116724 57.14683     3 FALSE     1 AB10 AB10.1
## 4 -2.116614 57.14714     4 FALSE     1 AB10 AB10.1
## 5 -2.115828 57.14766     5 FALSE     1 AB10 AB10.1
## 6 -2.115821 57.14766     6 FALSE     1 AB10 AB10.1

And now we can finally join the aggregated data with the dataframe from the shape file. We will do a left outer join to ensure we don’t miss any area codes, even if there is no house price data for them.

# join polygon dataframe with aggregated data
uk_map = dplyr::left_join(uk_map, ppdata_agg, by='id')

If we now check the results of the left join by looking at na values on the dataset we find there are some area codes that did not have any information of house prices.

# check all fields on aggregated data for NAs and count them
sapply(uk_map, function(x) sum(is.na(x)))
##                 x                 y             order              hole 
##                 0                 0                 0                 0 
##             piece                id             group        mean_price 
##                 0                 0                 0            261157 
##      median_price   mean_price_bins median_price_bins 
##            261157            261157            261157

But since we have our new map ready with all information we can check that directly in the map.

We will show a map with area code administrative divisions and color coding depending on the log of the house price median. We use the log again to accentuate the differences between the different areas, otherwise they will all be shown with the same color. Both versions of the graphs are shown below.

plot_median = ggplot(uk_map) +
  ggtitle("UK house median price (lighter color == lower price)", subtitle = 'by districts') +
  geom_polygon(aes(x=x, y=y, fill = median_price , group=group), color = 'gray', size = 0.1) +
  scale_fill_distiller(name='mean', palette='Spectral', breaks=pretty_breaks(n=10), label=comma) +
  coord_quickmap() +
  guides(fill=guide_legend(title='Median Price')) +
  theme_void()

plot_log_median = ggplot(uk_map) +
  ggtitle("UK house median log(price) (lighter color == lower price)", subtitle = 'by districts') +
  geom_polygon(aes(x=x, y=y, fill = log(median_price) , group=group), color = 'gray', size = 0.1) +
  scale_fill_distiller(name='mean', palette='Spectral', breaks=pretty_breaks(n=10)) +
  coord_quickmap() +
  guides(fill=guide_legend(title='Median Log(Price)')) + 
  theme_void()

grid.arrange(plot_median, plot_log_median, ncol=2)

The results are acceptable but as we suspected the values are highly concentrated on a very small range compared to the high and max values, therefore is difficult to discern any differences between the area code districts, even in the log(price) version

Then let’s redo this again but this time using the median price bins we created based on quantile data.

plot_median = ggplot(uk_map) +
  ggtitle("UK house median price (lighter color == lower price)", subtitle = 'by districts') +
  geom_polygon(aes(x=x, y=y, fill = median_price_bins , group=group), color = 'gray', size = 0.1) +
  scale_fill_manual(values = c('yellow1','yellow3', 'orange1','orange2','orange3', 'red1', 'red3','gray')) +
  scale_size_area() +
  coord_quickmap() +
  theme(legend.position='bottom') +
  guides(fill=guide_legend(title="Median Price")) +
  theme_void()

plot_median

It looks much better and it is far better to differentiate.

One more things we could improve is to crop the map, leaving out the portion we detected we were missing after our join, which is not part of our house prices dataset (gray color).

plot_median = ggplot(uk_map) +
  ggtitle("UK house median price (lighter color == lower price)", subtitle = 'by districts') +
  geom_polygon(aes(x=x, y=y, fill = median_price_bins , group=group), color = 'gray', size = 0.1) +
  scale_fill_manual(values = c('yellow1','yellow3', 'orange1','orange2','orange3', 'red1', 'red3','gray')) +
  coord_quickmap(ylim=c(50.18,55.66),xlim=c(-7.26,2.11)) +
  theme(legend.position='bottom') +
  guides(fill=guide_legend(title="Median Price")) +
  theme_void()

plot_median

Task B3

Instead of using median price, you could have been asked to colour-code the mean house price. Would that have given a better view of the house prices across the UK? Please justify your answer.

They will be definitely different. The reason why they are different has to do with how they are distributed. If we look at the quantiles for each of them we see the initial differences. We believe the median is better as it allows a greater amount of details specially when seen using the bins. Given the big outliers we have seen and situations where a low number of observations drive the mean value upward is not as good to determined the most prominent house price values of each area code.

summary(ppdata_agg$mean_price)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    39616   113239   151721   275599   206695 68591328
summary(ppdata_agg$median_price)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    24000    97500   133000   208412   175000 68591328

We can also appreciate the difference by looking at their log histograms, where it seems small because of the log transformation but it is very significant. The mean will tend to fall into higher valuespulled by those extreme outliers, while the median will maintain a more neutral position.

ggplot(ppdata_agg) + 
  geom_histogram(aes(x=log(mean_price)),bins=100, alpha=0.5, fill='orange') +
  geom_histogram(aes(x=log(median_price)),bins=100, alpha=0.5, fill='blue') +
  scale_colour_manual(name = '', values =c('orange'='orange','red'='red'), labels = c('mean','median'), guide='legend')

We can see below the cloropleths using the median(left) vs the mean(right). For example it is clear how a larger area is marked as high cost when using the mean, which is not necessarily as relevant as it might be driven by fewer observations.

plot_mean = ggplot(uk_map) +
  ggtitle("UK house mean price (lighter color == lower price)", subtitle = 'by districts') +
  geom_polygon(aes(x=x, y=y, fill = mean_price_bins , group=group), color = 'gray', size = 0.1) +
  scale_fill_manual(values = c('yellow1','yellow3', 'orange1','orange2','orange3', 'red1', 'red3','gray')) +
  coord_quickmap(ylim=c(50.18,55.66),xlim=c(-7.26,2.11)) +
  theme(legend.position='bottom') +
  guides(fill=guide_legend(title="Mean Price")) +
  theme_void()

grid.arrange(plot_median, plot_mean, ncol=2)

Task C1

Examine the house prices for 2015. How do these change over time? Do property prices seem to increase or decrease throughout the year?

First we will convert our date to date data type as it is originally in string format.

# Convert to date data type
ppdata = ppdata %>% mutate(date_of_transfer = as.Date(str_sub(ppdata$date_of_transfer,1,10),format='%Y-%m-%d'))
head(ppdata)
##   long_postcode          transaction_unique_identifier  price date_of_transfer
## 1       AL1 1AJ {F42AB372-96E9-4C62-A659-77AC8C0EAEDF} 159995       1997-07-25
## 2       AL1 1AJ {4911E580-FE15-43E0-9C80-3A36863ABBC7} 300000       2006-09-22
## 3       AL1 1AJ {656892E1-43C7-4CE5-83A3-28DA979FB9F4} 143995       1997-06-20
## 4       AL1 1AJ {B3BA577D-02A6-40BA-9C17-2167A5C27794} 435000       2011-06-30
## 5       AL1 1AJ {A0278D4E-0B75-48A7-AAA2-1587E9A7F9BE} 158950       1997-10-27
## 6       AL1 1AJ {6A55C1DF-1633-493F-A38B-3204615D3EE5} 139995       1997-06-27
##   property_type old_new duration PAON SAON       street   locality  town_city
## 1             S       Y        F   28      ORIENT CLOSE ST. ALBANS ST. ALBANS
## 2             T       N        F   21      ORIENT CLOSE ST. ALBANS ST. ALBANS
## 3             S       Y        F   17      ORIENT CLOSE ST. ALBANS ST. ALBANS
## 4             T       N        F   22      ORIENT CLOSE             ST ALBANS
## 5             S       Y        F    2      ORIENT CLOSE ST. ALBANS ST. ALBANS
## 6             S       Y        F   20      ORIENT CLOSE ST. ALBANS ST. ALBANS
##    district        county PPD_category_type record_status  old_id latitude
## 1 ST ALBANS HERTFORDSHIRE                 A             A 1578238 51.74448
## 2 ST ALBANS HERTFORDSHIRE                 A             A 1578238 51.74448
## 3 ST ALBANS HERTFORDSHIRE                 A             A 1578238 51.74448
## 4 ST ALBANS HERTFORDSHIRE                 A             A 1578238 51.74448
## 5 ST ALBANS HERTFORDSHIRE                 A             A 1578238 51.74448
## 6 ST ALBANS HERTFORDSHIRE                 A             A 1578238 51.74448
##    longitude  id
## 1 -0.3285997 AL1
## 2 -0.3285997 AL1
## 3 -0.3285997 AL1
## 4 -0.3285997 AL1
## 5 -0.3285997 AL1
## 6 -0.3285997 AL1

Then we set limits to the dataset to focus on 2015

# Set date limits
startDate = as.Date('2015-01-01','%Y-%m-%d')
endDate = as.Date('2016-01-01','%Y-%m-%d') 

# Subset the original dataset and group by relevant features
ppdata_ts = ppdata %>% 
  filter(date_of_transfer>=startDate & date_of_transfer<endDate) %>%
  group_by(date_of_transfer) %>% 
  dplyr::summarise(mean_price=mean(price),median_price=median(price), .groups ='keep')

head(ppdata_ts)
## # A tibble: 6 × 3
## # Groups:   date_of_transfer [6]
##   date_of_transfer mean_price median_price
##   <date>                <dbl>        <dbl>
## 1 2015-01-01          130250        130250
## 2 2015-01-02          244233.       188250
## 3 2015-01-04          380667.       237000
## 4 2015-01-05          232174.       182500
## 5 2015-01-06          352016.       245000
## 6 2015-01-07          304544.       230000

And now we can graph the data for that year and check for anomalies by focusing on mean prices first

ppdata_ts_plot = ggplot(ppdata_ts, aes(x=date_of_transfer, y=mean_price)) +
  labs(x = 'Date', y = 'Mean(price)', title = 'UK house mean prices by day in 2015') +
  geom_line(stat = 'summary', fun = 'mean', colour = 'black') +
  geom_smooth(formula=y~x, method = 'loess', se = FALSE, span = 0.2) +
  theme(axis.text.x = element_text(angle = -90, vjust = 1, hjust = 0, size=10), legend.position = 'bottom')

ppdata_ts_plot

We see a huge spike on the mean price on December 29, 2015. It would be interesting to know why the mean is so high on that day.

Anyways, at a daily level it is very hard to see any representative seasonality trends on the mean price, so we will aggregate further to month

# aggregate data to month level
ppdata_ts = ppdata %>%
  filter(date_of_transfer>=startDate & date_of_transfer<endDate) %>%
  mutate(year_month = paste0(format(date_of_transfer, '%Y'),format(date_of_transfer, '%m'))) %>%
  mutate(year_month = as.Date(paste0(year_month,'01'),'%Y%m%d')) %>%
  group_by(year_month) %>%
  summarise(mean_price = mean(price), median_price = median(price), n=n(), .groups='keep')
# preview monthly data aggregation
ppdata_ts
## # A tibble: 12 × 4
## # Groups:   year_month [12]
##    year_month mean_price median_price     n
##    <date>          <dbl>        <dbl> <int>
##  1 2015-01-01    281680.       195000  7308
##  2 2015-02-01    268930.       190000  7510
##  3 2015-03-01    312461.       190000  8709
##  4 2015-04-01    292262.       196500  8304
##  5 2015-05-01    274576.       197000  9353
##  6 2015-06-01    291203.       204950 10807
##  7 2015-07-01    316227.       210000 11612
##  8 2015-08-01    294418.       217500 10597
##  9 2015-09-01    301187.       214950 10337
## 10 2015-10-01    286891.       210000 11102
## 11 2015-11-01    292719.       206995 10069
## 12 2015-12-01    317948.       207000  9867
ppdata_ts_plot = ggplot(ppdata_ts) +
  labs(x = 'Date', y = 'Sales operations', title = 'UK house sales operations by month in 2015') +
  geom_bar(aes(x=year_month, y=n),stat='identity') +
  theme(axis.text.x = element_text(angle = -90, vjust = 1, hjust = 0, size=10), legend.position = 'bottom') +
  geom_text(aes(y = 200, label = paste0("n = ", n), x=year_month), size=2) +
  scale_y_continuous(labels=comma)

ppdata_ts_plot

ppdata_ts_plot = ggplot(ppdata_ts) +
  labs(x = 'Date', y = 'Mean(price) and Median(price)', title = 'UK house mean and median prices by month in 2015') +
  geom_line(aes(x=year_month,y=mean_price, colour = 'orange')) +
  geom_line(aes(x=year_month,y=median_price, colour = 'red'), ) +
  scale_colour_manual(name = '', values =c('orange'='orange','red'='red'), labels = c('mean','median'), guide='legend') +
  theme(axis.text.x = element_text(angle = -90, vjust = 1, hjust = 0, size=10), legend.position='bottom') +
  scale_y_continuous(labels=comma)

ppdata_ts_plot

There are no obvious anomalies by looking at the mean and median price data at the month level. Same case when looking at the amount of cases per month. So let’s go back and revisit the peak we had for December 29, 2015.

# Set date limits
startDate = as.Date('2015-12-29','%Y-%m-%d')
endDate = as.Date('2015-12-30','%Y-%m-%d') 

# Subset the original dataset and group by relevant features
ppdata_ts = ppdata %>% 
  filter(date_of_transfer>=startDate & date_of_transfer<endDate)

ppdata_ts_plot = ggplot(ppdata_ts, aes(x=district, y=price)) +
  labs(x = 'District', y = 'Mean(price)', title = 'UK house mean prices by district on 2015-DEC-29') +
  geom_bar(stat = 'summary', fun = 'mean', colour = 'black') +
  theme(axis.text.x = element_text(angle = -90, vjust = 1, hjust = 0, size=7), legend.position = 'bottom') +
  scale_y_continuous(labels = comma)
  
ppdata_ts_plot

There seems to be a 20 million sale on a specific district, let’s analyze the data further

ppdata_ts = ppdata %>% 
  filter(date_of_transfer>=startDate & date_of_transfer<endDate & district == 'SOUTH GLOUCESTERSHIRE')

ppdata_ts
##   long_postcode          transaction_unique_identifier    price
## 1      BS32 4TE {2AC10E50-54D7-1AF6-E050-A8C063052BA1} 19900000
##   date_of_transfer property_type old_new duration PAON SAON     street
## 1       2015-12-29             O       N        F 2000      AZTEC WEST
##      locality town_city              district                county
## 1 ALMONDSBURY   BRISTOL SOUTH GLOUCESTERSHIRE SOUTH GLOUCESTERSHIRE
##   PPD_category_type record_status  old_id latitude longitude   id
## 1                 B             A 1516327 51.54334 -2.571513 BS32

So there is only one house price on that day and district which accounts for the huge spike. Let’s locate where this sale happened exactly, along with others above 19 million sales operations.

# filter dataset to show only sales operations above 19 million
uk_lf = ppdata %>% 
  select(latitude, longitude, price) %>% 
  filter(price>19000000)
# map sales operations above 19 million
map = uk_lf %>% 
  leaflet(options =leafletOptions(dragging=FALSE)) %>% 
  addTiles() %>% 
  addCircleMarkers(~longitude,~latitude, color='blue') %>%
  addCircleMarkers(-2.571513, 51.54334,  color='red')

map

The blue circles mark prices above 19 million, the red one marks the one we analized that happened in December 29, 2015.

Task C2

C2. Is there a significant relationship between the price of a property and the time of year it is sold? Does this vary with type of property?

In terms of seasonality let’s have a look at the full dataset first using a full temporal graph of the complete dataset (day level time series) and calculating the median and mean price for each date provided.

# create aggregated table by day
ppdata_dyg = ppdata %>%
  select(date_of_transfer, price) %>%
  group_by(date_of_transfer) %>% 
  dplyr::summarise(mean_price = mean(price), median_price=median(price), .groups='keep')

# transform median price data to time series format to be used with dygraph
ppdata_dyg=xts(ppdata_dyg[,-1], ppdata_dyg$date_of_transfer)

We will plot the time series for mean and median prices and make some annotations on some relevant data points (day level)

ppdata_dyg_plot = dygraph(data=ppdata_dyg, main = 'House mean and median price in the UK', xlab='Date', ylab='Price') %>%
  dySeries("mean_price", drawPoints = TRUE, pointShape = "square", color = "blue") %>%
  dyAnnotation("2015-12-29", text = "spike 2015", tooltip = "case reviewed in exercise") %>%
  dyAnnotation("2014-08-02", text = "biggest mean", tooltip = "biggest mean") %>%
  dySeries("median_price", drawPoints = TRUE, pointShape = "square", color = "green") %>%
  dyAnnotation("2011-06-26", text = "biggest meadian", tooltip = "biggest median") %>%
  dyRangeSelector() %>%
  dyAxis('x', label = "Date") %>%
  dyAxis('y', axisLabelFormatter='function(v){return (v).toFixed(0)}') %>%
  dyOptions(axisLabelFontSize=10, fillGraph = FALSE, fillAlpha=0.1, drawPoints=TRUE)

ppdata_dyg_plot

With those big outliers on the mean price it is hard to appreciate so let’s clip the outliers. We will clip out any house mean price above 4 million for the purpose of this exercise.

# filter out any mean price above 4 million
ppdata_dyg = ppdata %>%
  select(date_of_transfer, price) %>%
  filter(price < 4000000) %>%
  group_by(date_of_transfer) %>% 
  dplyr::summarise(mean_price=mean(price), median_price = median(price), .groups='keep')

# transform to time series
ppdata_dyg=xts(ppdata_dyg[,-1], ppdata_dyg$date_of_transfer)

ppdata_dyg_plot = dygraph(data=ppdata_dyg, main = 'House mean and median price in the UK', xlab='Date', ylab='Price') %>%
  dySeries("mean_price", drawPoints = TRUE, pointShape = "square", color = "blue") %>%
  dySeries("median_price", drawPoints = TRUE, pointShape = "square", color = "green") %>%
  dyRangeSelector() %>%
  dyAxis('x', label = "Date") %>%
  dyAxis('y', axisLabelFormatter='function(v){return (v).toFixed(0)}') %>%
  dyOptions(axisLabelFontSize=10, fillGraph = FALSE, fillAlpha=0.1, drawPoints=TRUE)
ppdata_dyg_plot

And let’s finally see if the log version of the median and mean can give us any additional insight.

# aggregate log mean and median at day level
ppdata_dyg = ppdata %>%
  select(date_of_transfer, price) %>%
  group_by(date_of_transfer) %>% 
  dplyr::summarise(mean_price = mean(log(price)), median_price = median(log(price)), .groups='keep')

# transform to time series
ppdata_dyg=xts(ppdata_dyg[,-1], ppdata_dyg$date_of_transfer)

ppdata_dyg_plot = dygraph(data=ppdata_dyg, main = 'House price in the UK', xlab='Date', ylab='Price') %>%
  dySeries("mean_price", drawPoints = TRUE, pointShape = "square", color = "blue") %>%
  dySeries("median_price", drawPoints = TRUE, pointShape = "square", color = "green") %>%
  dyRangeSelector() %>%
  dyAnnotation('2015-12-29', text = 'Spike', tooltip = 'The case we spotted in 2015') %>%
  dyAxis('x', label = "Date") %>%
  dyAxis('y', axisLabelFormatter='function(v){return (v).toFixed(0)}') %>%
  dyOptions(axisLabelFontSize=10, fillGraph = FALSE, fillAlpha=0.1, drawPoints=TRUE)
ppdata_dyg_plot

The logs of mean and median prices is definitely not very insightful. We will keep the original clipped version of the mean and median values.

However, even in that graph, there is no seasonality on the house prices that can be clearly observed at day level, but maybe this is due having too many data points, let’s see the results if we aggregate by month and we graph again with ggplot and dygraph

ppdata_ts = ppdata %>%
  mutate(year_month = paste0(format(date_of_transfer, '%Y'),format(date_of_transfer, '%m'))) %>%
  mutate(year_month = as.Date(paste0(year_month,'01'),'%Y%m%d')) %>%
  group_by(year_month) %>%
  summarise(mean_price = mean(price), median_price=median(price), .groups='keep')

ppdata_ts_plot = ggplot(ppdata_ts, aes(x=year_month)) +
  labs(x = 'Month', y = 'Mean and median price', title = 'UK house mean and median prices by month') +
  geom_line(aes(x=year_month, y=mean_price), colour = 'blue') +
  geom_line(aes(x=year_month, y=median_price), colour = 'green') +
  geom_smooth(aes(x=year_month, y=mean_price, color='blue3'), formula=y~x, method = 'loess', se = FALSE, span = 0.2) +
  geom_smooth(aes(x=year_month, y=median_price, color='green3'), formula=y~x, method = 'loess', se = FALSE, span = 0.2) +
  scale_colour_manual(name = '', values =c('blue3'='blue3','green3'='green3'), labels = c('mean','median'), guide='legend') +
  theme(axis.text.x = element_text(angle = -90, vjust = 1, hjust = 0, size=8), legend.position = 'bottom') +
  scale_y_continuous(labels=comma) +
  scale_x_date(date_breaks='1 year')
  
ppdata_ts_plot

By looking at the mean and median prices at month level the uptrend is more clear, but there is not much of a seasonality in the data either. Or at least no easily spotted at first sight.

Let’s look not at price but at sales operations per month.

# aggregate count data to month level
ppdata_ts = ppdata %>%
  mutate(year_month = paste0(format(date_of_transfer, '%Y'),format(date_of_transfer, '%m'))) %>%
  mutate(year_month = as.Date(paste0(year_month,'01'),'%Y%m%d')) %>%
  group_by(year_month) %>%
  summarise(price = mean(price),n=n(), .groups='keep')

ppdata_ts_plot = ggplot(ppdata_ts, aes(x=year_month, y=n)) +
  labs(x = 'Month', y = 'Sales operations', title = 'UK house sales operations per month') +
  geom_line(aes(x=year_month, y=n),colour = 'black') +
  geom_smooth(formula=y~x, method = 'loess', se = FALSE, span = 0.2) +
  theme(axis.text.x = element_text(angle = -90, vjust = 1, hjust = 0, size=8)) +
  scale_y_continuous(labels=comma) +
  scale_x_date(date_breaks='1 year')
  
ppdata_ts_plot

But when we look at the number of transactions there is definitely a repetition pattern, which seem to indicate that house price operations drop at the beginning of each year.

In terms of property type we have the following values: D = Detached, S = Semi-Detached, T = Terraced, F = Flats/Maisonettes, O = Other. Let’s see if any of those are correlated to house price. We will also use in our analysis the feature old_new which indicates if the house was old or new as its name indicates.

# print out the unique values for the property type dimension
unique(ppdata$property_type)
## [1] "S" "T" "F" "D" "O"
# apply new frienly names to property types
ppdata_new = ppdata %>%
    mutate(property_type_new = case_when(
      property_type == 'D' ~ 'Detached',
      property_type == 'S' ~ 'Semi-Detached',
      property_type == 'T' ~ 'Terraced',
      property_type == 'F' ~ 'Flats/Maisonettes',
      property_type == 'O' ~ 'Other',
      TRUE ~ 'NA'))

# Add count of observations per district
ppdata_count = ppdata_new %>% 
  group_by(property_type_new, old_new) %>% 
  dplyr::summarize(n=n(), mean_price=mean(price), median_price=median(price), .groups='keep')
ppdata_ts_plot = ggplot(data=ppdata_new, aes(x=old_new, y=price, color=old_new)) +
  labs(x = 'Property type', y = 'Mean price', title = 'UK house mean price by property type', color='Is property new?') +
  geom_jitter( size = 1, alpha=0.1) + 
  theme(axis.text.x = element_text(vjust = 1, hjust = 0), legend.position = 'bottom') +
  scale_y_log10(labels = comma, limits = c(100,1e8)) +
  facet_wrap(~property_type_new, drop=FALSE, nrow=1, ncol=5)

ppdata_ts_plot

Based on the graph above we can see that the property log price seems to only be higher than the rest for the classification “Other”, which at the same time has a much higher amount of operations for established residential buildings (not new). The rest of property types do not seem to drive a considerable difference in price from a visual assessment.